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MANIFOLD ESTIMATION AND SINGULAR DECONVOLUTION 
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University 

We find lower and upper bounds for the risk of estimating a man- 
ifold in Hausdorff distance under several models. We also show that 
there are close connections between manifold estimation and the 
problem of deconvolving a singular measure. 

1. Introduction. Manifold learning is an area of intense research activity 
in machine learning and statistics. Yet a very basic question about manifold 
learning is still open, namely, how well can we estimate a manifold from n 
noisy samples? In this paper we investigate this question under various as- 
sumptions. 

Suppose we observe a random sample Yi,. . . ,Yn € that lies on or 
near a d-manifold M where d < D. The question we address is: what is 
the minimax risk under Hausdorff distance for estimating M? Our main 
assumption is that M is a d-dimensional, smooth Riemannian submanifold 
in M^; the precise conditions on M are given in Section 2. 

Let Q denote the distribution of Yi. We shall see that Q depends on 
several things, including the manifold M, a distribution G supported on M 
and a model for the noise. We consider three noise models. The first is the 
noiseless model in which li, . . . , 1^ is a random sample from G. The second 
is the clutter noise model, in which 

(1) yi,...,y„~(l-7r)[/ + 7rG, 
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where C/ is a uniform distribution on a compact set /C C M with nonempty 
interior, and G is supported on M . (When vr = 1 we recover the noiseless 
case.) The third is the additive model, 

(2) Y, = Xi + Zi, 

where Xi, . . . , Xn ~ G, G is supported on M, and the noise variables Zi, . . . , 
Zn are a sample from a distribution $ on which we take to be Gaussian. 
In this case, the distribution Q of 1" is a convolution of G and ^ written 
Q = G*$. 

In a previous paper [Genovese et al. (2010)], we considered a noise model 
in which the noise is perpendicular to the manifold. This model is also 
considered in Niyogi, Smale and Weinberger (2011). Since we have already 
studied that model, we shall not consider it further here. 

In the additive model, estimating M is related to estimating the distri- 
bution G, a problem that is usually called deconvolution [Fan (1991)]. The 
problem of deconvolution is well studied in the statistical literature, but in 
the manifold case there is an interesting complication: the measure G is sin- 
gular because it puts all its mass on a subset of MP that has zero Lebesgue 
measure (since the manifold has dimension d < D). Deconvolution of singu- 
lar measures has not received as much attention as standard deconvolution 
problems and raises interesting challenges. 

Each noise model gives rise to a class of distributions Q for Y defined 
more precisely in Section 2. We are interested in the minimax risk 

(3) Rn = Rn{Q) = ^ni sup Kq[H{M,M)], 

M Q€Q 

where the infimum is over all estimators M, and H is the Hausdorff distance 
[defined in equation (4)]. Note that finding the minimax risk is equivalent 
to finding the sample complexity n{e) = inf{n : Rn < e}- We emphasize that 
the goal of this paper is to find the minimax rates, not to find practical 
estimators. We use the Hausdorff distance because it is one of the most 
commonly used metrics for assessing the accuracy of set-valued estimators. 
One could of course create other loss functions and study their properties, 
but this is beyond the scope of this paper. Finally, we remark that our upper 
bounds sometimes differ from our lower bounds by a logarithmic factor. This 
is a common phenomenon when dealing with Hausdorff distance (and sup 
norm in function estimation problems). Currently, we do not know how to 
eliminate the log factor. 

1.1. Related work. In the additive noise case, estimating a manifold is 
related to deconvolution problems such as those in Fan (1991), Fan and 
Truong (1993) and Stefanski (1990). More closely related is the problem of 
estimating the support of a distribution in the presence of noise as discussed, 
for example, in Meister (2006). 
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There is a vast literature on manifold estimation. Much of the literature 
deals with using manifolds for the purpose of dimension reduction. See, for 
example, Baraniuk and Wakin (2009) and references therein. We are inter- 
ested instead in actually estimating the manifold itself. There is a literature 
on this problem in the field of computational geometry; see Dey (2007). 
However, very few papers allow for noise in the statistical sense, by which 
we mean observations drawn randomly from a distribution. In the literature 
on computational geometry, observations are called noisy if they depart from 
the underlying manifold in a very specific way: the observations have to be 
close to the manifold but not too close to each other. This notion of noise 
is quite different from random sampling from a distribution. An exception 
is Niyogi, Smale and Weinberger (2008), who constructed the following es- 
timator: Let / = {i:p{Yi) > A} where p is a density estimator. They define 
M = [j-^j Bj:,{Yi,£) where BD{Yi,£) is a ball in of radius e centered 
at Yi. Niyogi, Smale and Weinberger (2008) show that if A and e are chosen 
properly, then M is homologous to M. This means that M and M share 
certain topological properties. However, the result does not guarantee close- 
ness in Hausdorff distance. A very relevant paper is Caillerie et al. (2011). 
These authors consider observations generated from a manifold and then 
contaminated by additive noise as we do in Section 5. Also, they use de- 
convolution methods as we do. However, their interest is in upper bounding 
the Wasserstein distance between an estimator G and the distribution G, as 
a prelude to estimating the homology of M . They do not establish Hausdorff 
bounds. Koltchinskii (2000) considers estimating the number of connected 
components of a set, contaiminated by additive noise. This corresponds to 
estimating the zeroth order homology. 

There is a also a literature on estimating principal surfaces. A recent paper 
on this approach with an excellent review is Ozertem and Erdogmus (2011). 
This is similar to estimating manifolds but, to the best of our knowledge, 
this literature does not establish minimax bounds for estimation in Hausdorff 
distance. Finally we would like to mention the related problem of testing for 
a set of points on a surface in a field of uniform noise [Arias-Castro et al. 
(2005)], but, despite some similarity, this problem is quite different. 

1.2. Notation. We let Bo[x,r) denote a L'-dimensional open ball cen- 
tered at X with radius r. If ^ is a set, and x is a point, then we write 
d{x, A) = vaiyizA \\x — y\\ where || • || is the Euclidean norm. Given two sets A 
and B, the Hausdorff distance between A and B is 





where 



(5) 
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The Li distance between two distributions P and Q with densities p and q 
is ii{p,q) = J \p — q\ and the total variation distance between P and Q is 

(6) TV(P,Q)=sup \PiA)-Q{A)\, 

A 

where the supremum is over ah measurable sets A. RecaU that T\/{P, Q) = 
{l/2)h{p,q). 

Let p[x) A q{x) = min{p(x), The affinity between P and Q is 

(7) \\P^Q\\= I P^q = l-\ I \p-q\. 



2 „ 

Let P" denote the n-fold product measure based on n independent obser- 
vations from P. It can be shown that 

(8) \\P''f\Q^>\(i-\j\p 

The convolution between two measures P and $ — denoted by — is the 
measure defined by 

(9) {P*<^){A) = J <^{A-x)dP{x). 

If $ has density (p, then P -k ^ has density / (p{y — u)dP{u). The Fourier 
transform of P is denoted by 

(10) p*{t) = j e^*^"dP(n) = j e''-''dP{u), 

where we use both t'^u and t ■ u to denote the dot product. 

We write Xn = Op{an) to mean that for every e > 0, there exists C > 
such that P(||X„||/a„ > C) < e for all large n. Throughout, we use symbols 
like C, Co, Ci, c, Co, ci, . . . to denote generic positive constants whose value 
may be different in different expressions. We write poly(e) to denote any 
expression of the form a for some positive real numbers a and b. We write 
On ^ bn if there exists c > such that a„ < c6„ for all large n. Similarly, 
write Gn ^ bn if bn^CLn- Finally, write a„ x 6„ if a„ ^ bn and bn^dn- 

We will use Le Cam's lemma to derive lower bounds, which we now state. 
This version is from Yu (1997). 

Lemma 1 (Le Cam 1973). Let Q be a set of distributions. Let 9{Q) 
take values in a metric space with metric p. Let Qo,Qi & Q be any pair 
of distributions in Q. Let Yi,...,Yn be drawn i.i.d. from some Q £ Q and 
denote the corresponding product measure by Q". Let 6 = 6{Yi, . . . ,Yn) be 
any estimator. Then 

sup Egn [p{e, e{Q))] > p{eiQo),eiQimQ^ a 

> p{eiQo),eiQi))l{i - TV(go,Qi))'". 
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2. Assumptions. We shall be concerned with d-dimensional Riemannian 
submanifolds of where d < D. Usually, we assume that M is contained 
in some compact set /C C M.^ . An exception is Section 5 where we allow 
noncompact manifolds. Let A(M) be the largest r such that each point in 
M (Br has a unique projection onto M. The quantity A(M) will be small 
if either M is not smooth or if M is close to being self-intersecting. The 
quantity A(M) has been rediscovered many times. It is called the condition 
number in Niyogi, Smale and Weinberger (2008) and the reach in Federer 
(1959). Let Ai(K,) denote all d-dimensional manifolds embedded in such 
that A(M) > K. Throughout this paper, k is a fixed positive constant. 

We consider three different distributional models: 

(1) Noiseless. We observe Yi, . . . ,Yn ^ G where G is supported on a man- 
ifold M where M £ M = {M £ M{k),M cK.}. In this case, Q = G and the 
observed data fall exactly on the manifold. We assume that G has density g 
with respect to the uniform distribution on M and that 

(11) < b{M) < inf g{y) < sup g{y) < B{M) < oo, 

y&M y(zM 

where b{Ai) and B{M) are allowed to depend on the class M, but not on 
the particular manifold M. Let G{M) denote all such distributions. In this 
case we define 

(12) Q = g= [J g{M). 

AIGM 

(2) Clutter noise. Define M and g{M) as in the noiseless case. We observe 

(13) Yl,...,Yn^Q={l-TT)U + TTG, 

where < vr < 1, U is uniform on the compact set /C C and G G Q{M). 
Define 

(14) Q = {Q = {l-Tr)U + TrG:Geg{M),M eM}. 

(3) Additive noise. In this case we allow the manifolds to be noncompact. 
However, we do require that each G put nontrivial probability in some fixed 
compact set. Specifically, we again fix a compact set /C. Let M = M{k). Fix 
positive constants < b{M) < B{M) < oo. For any M e M, let Q{M) be 
the set of distributions G supported on M, such that G has density g with 
respect to Hausdorff measure on M, and such that 

(15) 0<6(A^)< inf g{y) < sup g{y) < B{M) < oo. 
Let Xi,X2,...,X„~GGa(M), and define 



(16) 



Yi=Xi + Zi, i = l,...,n, 
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where Zi are i.i.d. draws from a distribution $ on M , and where $ is 
a standard D-dimensional Gaussian. Let Q = G -k ^ be the distribution of 
each Yi and be the corresponding product measure. Let Q = {G*<^ : G G 
giM),M£M}. 

These three models are an attempt to capture the idea that we have data 
falling on or near a manifold. These appear to be the most commonly used 
models. No doubt, one could create other models as well which is a topic for 
future research. As we mentioned earlier, a different noise model is consid- 
ered in Niyogi, Smale and Weinberger (2011) and in Genovese et al. (2010). 
Those authors consider the case where the noise is perpendicular to the 
manifold. The former paper considers estimating the homology groups of M 
while the latter paper shows that the minimax Hausdorff rate is n"^/^^"'"'^^ 
in that case. 

3. Noiseless case. We now derive the minimax bounds in the noiseless 
case. 

Theorem 2. Under the noiseless model, we have 

(17) inf sup Egn [H{M, M)] > Cn''^''^. 
M Q&Q 

Proof. Fix 7 > 0. By Theorem 6 of Genovese et al. (2010) there exist 
manifolds Mo,Mi that satisfy the following conditions: 

(1) Mo,MieM. 

(2) H{Mo,M,)=^. 

(3) There is a set B C Mi such that: 

(a) infygMo ll^; — y\\ > 7/2 for all x £ B. 

(b) Hi{B) > 7*^/^ where /ii is the uniform measure on Mi. 

(c) There is a point x € B such that — y|| = 7 where y G Mq is the 
closest point on Mq to x. Moreover, TxMi and TyM^ are parallel 
where T^M is the tangent plane to M at x. 

(4) If A = {y : y G Ml, y ^ Mq}, then /ii(^) < C7'^/2 for some C > 0. 

Let Qi = Gi be the uniform measure on Mj, for z = 0, 1, and let A be the 
set defined in the last item. Then TV(Go,Gi) = Gi{A) - Gq{A) = Gi{A) < 
From Le Cam's lemma, 

(18) supEQni/(M,M)>7(l-7'^/2)2n_ 
QeS 

Setting 7 = (l/n)^/*^ yields the stated lower bound. □ 

See Figure 1 for a heuristic explanation of the construction of the two 
manifolds, Mq and Mi, used in the above proof. Now we derive an upper 
bound. 
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Fig. 1. The proof of Theorem 2 uses two manifolds, Mo and Mi. A sphere of radius k 
is pushed upward into the plane Mo (top left). The resulting manifold Mq ts not smooth 
(top right). A sphere is then rolled around the manifold (bottom left) to produce a smooth 
manifold Mi (bottom right). The construction is made rigorous in Theorem 6 of Genovese 
et al. (2010). 



Theorem 3. Under the noiseless model, we have 

(19) inf sup Egn [H{M, M)] < C 

M Q&Q 

Hence, the rate is tight, up to logarithmic factors. The proof is a special 
case of the proof of the upper bound in the next section and so is omitted. 

Remark. The Associate Editor pointed out that the rate (l/n)^/*^ might 
seem counterintuitive. For example, when d=l, this yields (1/n)^ which 
would seem to contradict the usual 1/n rate for estimating the support of 
a uniform distribution. However, the slower 1/n rate is actually a boundary 
effect much like the boundary effects that occur in density estimation and 
regression. If we embed the uniform into M? and wrap it into a circle to 
eliminate the boundary, we do indeed get a rate of 1/n^. Our assumption of 
smooth manifolds without boundary removes the boundary effect. 

4. Clutter noise. Recall that 

yi,...,y„~Q = (l-7r)[/ + 7rG, 
where U is uniform on /C, < vr < 1 and G 




8 GENOVESE, PERONE-PACIFICO, VERDINELLI AND WASSERMAN 



I I bo En 
< biVen > 

Fig. 2. Given a manifold M and a point y £ M , Sni{y) is a slab, centered at y, with size 
0{y/&^) m the d directions corresponding to the tangent space TyM and size 0{en) in the 
D — d normal directions. 

Theorem 4. Under the clutter model, we have 

_ / 1 

(20) inf sup Enn [H{M, M)]>C[ — 
M Q&Q 

Proof. We define Mq, Mi and A as in the proof of Theorem 2. Let Qo = 
(l-7r)C/ + 7rGo and Qi = (1 - 7r)C/ + 7rGi. Then TV(Qo,Qi) = 7rTV(Go,Gi). 
Hence TV(Qo,'3i) < ti{Gi{A)-Go{A)) = ^Gi{A) < Cti-i'^/^. From Le Cam's 
lemma, 

(21) sup Egn [i/(M, M)] > 7(1 -^7'^/2)2n_ 

Q&Q 

Setting 7 = (l/nvr)^/'^ yields the stated lower bound. □ 

Now we consider the upper bound. Let Qn be the empirical measure. 
Let En = (Klogn/n)'^^'^ where -fT > is a large positive constant. Given 
a manifold M and a point y £ M \et SMiu) denote the slab, centered at y, 
with size bi^/e^ in the d directions corresponding to the tangent space TyM 
and size b2£n in the D — d normal directions to the tangent space. Here, bi 
and 62 are small, positive constants. See Figure 2. 

Define 

s{M) = inf Qn[SM{y)] and M„ = arg max s(M). 

In case of ties we take any maximizer. 

Theorem 5. Let ^ > 1 and let e„ = (K log n/n)^/"' where K is a large, 
positive constant. Then 

sup g"(i7(Mo,M„) >e 



and hence 



sup EQn{H{Mo,Mn))<Cer. 



We will use the following result, which follows from Theorem 7 of Bous- 
quet, Boucheron and Lugosi (2004). This version of the result is from Chaud- 
huri and Dasgupta (2010). 
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Lemma 6. Let A he a class of sets with VC dimension V. Let < u < 1 
and 



Flog(2n) + log - 
u 



/5n — 

Then for all A, 



<Q{A)- Qn{A) < mm{f3l + f3n^J QniA), (3nVQ(A)} 
with probability at least 1 — u. 

The set of hyper-rectangles in (which contains all the slabs) has finite 
VC dimension V, say. Hence, we have the following lemma obtained by 
setting u = (1/n)^. 

Lemma 7. Let A denote all hyper-rectangles in M^. Let C = 4[y + 
max{3, ^}] . Then for all A £ A, 



(22) Q^(A)<Q(A) + ^^ + J^^y^ and 

n \ n 



(23) Qn{A) > Q{A) - ^^^y^Q(A) 

with probability at least 1 — (1/n)^. 
Now we can prove Theorem 5. 

Proof of Theorem 5. Let Mq denote the true manifold. Assume 
that (22) and (23) hold. Let y £ Mq and let A = SMoiv)- Note that Q{A) = 
(1 - 7r)U{A) + 7rG{A). Since y £ Mq and G is singular, the term U{A) is of 
lower order and so there exist < ci < C2 < oo such that, for all large n, 

^i^^ = ce^ < Q{A) < ceT = 

n n 

Hence 



Q^{A) > QiA) - J^^VQ(A)>'-i^^ - ,1^'^ > ^^""^^^^ 



n n 1 n n 

Thus s(Mo) > ^j^h high probablity. 

Now consider any M for which H{Mq,M) > e^- There exists a point 
y £ M such that d{y, Mq) > e„. It can be seen, since M G A^, that 5A/(y) n 
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Mq = 0. [To see this, note that A(M) > k > imphes that the interior of 
any ball of radius k tangent to M aX y has empty intersection with M and 
the slab SM{y) is strictly contained in such a ball for bi and 62 small enough 
relative to k.] Hence 

Q{Sm{v)) = (1 - ^)U[SM{y)) = c^eTe^-" 



n / \ n / \ n 

So, from the previous lemma, 
s{M) = inf QniSuix)) < Qn{SM{y)) 

n \ n 

Klogn\^'^^~'^^^^ Clogn f K\ogn\^^'^ C^Klog 



n 



-I +^^^^ + { — ^ =g(Mo: 

n J n \ n J n 

since D >d and K is large. Let Mn = {M G M : H{Mq,M) > e„}. We con- 
clude that 



Q"(s(M) > s(Afo) for some M G < ( ^ 



□ 



5. Additive noise. Let us recall the model. Let M = M{k). We allow the 
manifolds to be noncompact. Fix positive constants < h[Ai) < B{Ai) < 00. 
For any M G let Q{M) be the set of distributions G supported on M such 
that G has density g with respect to Hausdorff measure on M and such that 

(24) < b{M) < inf g{y) < sup g{y) < B{M) < 00, 

yeAInK y£MnK 

where /C is a compact set. Let Xi,X2, ■ ■ ■ , ~ G G Q{M)^ and define 

(25) Yi = Xi + Zi, f = l,...,n, 

where Zi are i.i.d. draws from a distribution $ on M^, and where ^ is 
a standard D-dimensional Gaussian. Let Q = G * $ be the distribution of 
each Yi and Q" be the corresponding product measure. Let Q = {G-k^ : G G 
g(M),MG7W}. 

Since we allow the manifolds to be noncompact, the Hausdorff distance 
could be unbounded. Hence we define a truncated loss function, 

(26) L{M,M) = H{MnJC,Mr\JC). 
Theorem 8. For all large enough n, 

(27) inf sup Eg [L(M, M)] > ^ 
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Fig. 3. The two least favorable manifolds Mo and Mi in the proof of Theorem 8 in the 
special case where D = 2 and d= 1. 



Proof. Define c:M M and c-.W^ ^ R^''^ as follows: c{x) = 
cos(x/(ay^)) and c{u) = {Ylf^ic{ue),0, ■ . ■ ,0)^ . Let Mq = {(u, 7c(n)) : u G 
W^} and Ml = {{u,--fc{u)) -.u G R"^}. See Figure 3 for a picture of Mq 
and Ml when D = 2,d = 1. Later, we will show that Mq, Mi G A4. 

Let U he a d-dimensional random variable with density ( where ^ is d- 
dimensional standard Gaussian density. Let be a one-dimensional A^(0, 1) 
density. And define Go and d by Go{A) =F{{U,jc{U)) G A) and Gi{A) = 
¥{{U,-jc{U))eA). 

We begin by bounding J \ qi — goP- Define the D-cube Z = [—l/{2a^), 
l/{2a^/j)]^ . Then, by Parseval's identity, and that fact that qj = 'P*gj, 

(2vr)^y" \qi-qo\' = J \ql - q*o\' = J \cP*\'\gl - g*o\' 

= [ \r\''\gl-go\''+ [ m'\gl-go\' 
= 1 + 11. 

Then 

//= / \gUt)-gm'\^*it)\' 

JZ'^ 

< [ |<A*(t)l'<cf r e-^'dtV 

Now we bound /. Write t£MP as (ti,i2) where ti = (in, . . . ,tid) G 

and t2 = {t2i:...,t2{D~d))^^^"^- Let ci(u)=n" =ic{ui) denote the first 
component of the vector-valued function c. We have 

g*i{t) - gl{t) = [ (e**i-"+**2i^^i(") - e**i-"-**2i^'^i("))C(n) du 
= 2i j e^*i-"sin(t2i7Ci(n))C(n)(i'u 
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°° C_i\A.72fc+l 2fc+l d 
fc=0 ^ '^^ f=l'^ 

fc=0 ^ ^' €=1 

fc=0 V ' ^ £=1 

where 

(28) iJikitu) = c^'^'cntu) = ( ?^?^^---^?x *)(ti,). 

2fc+l times 

Note that 

where 6y a Dirac delta function at y, that is, a generahzed function corre- 
sponding to point evaluation at y. For any integer r, if we convolve c* with 
itself r times, we have that 

(29) r*c'---^^ = (J]^T.(j)sa^, 

r times J— 

where aj = (2j — r)/{a^). Thus 

(30) „x,(ti,) = Q^'^' (^^>' + ^y*^tu-a,). 

Now C*(ti£) = exp(-%) and C{s) < 1 for ah s G M. For i G 2:, C*(iw - 
aj) < e^i/(2a27)^ ^^^g \mk{tu)\ < e-^/^^a^^)^ jj^^^^^^ nlikfc(*w)l < 
g-d/(2a27). It follows that for t e ^, 

oo I |2fc+l 2A;+1 ^ 
fc=0 ^ ^=1 



<g-d/(2a^)g.^j^(|^^^|^)<g-d/(2a27)_ 



So, 
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1= ljgl{t)-gm'\^*it)\'dt 

< jm)-gl{t)\^dt 

< Volume(Z)e-'^/("'^) = poly(7)e-'^/('''^). 

j ki - 901' <I + n< poly(7)e-'^/'^'^ + poly(7)e-^/^'^'^ 

(31) =poly(7)e-2-/7, 

where 2w = m.ui{d / , D / {'ia'^)^ . 

Next we bound / — (?o| so that we can apply Le Cam's lemma. Let 
be a ball centered at the origin with radius I/7. Then, by Cauchy-Schwarz, 



Hence, 



7^c 

7 



< Y^Volume(T^)y j \qi - gop + j^^ \qi - go| 

< poly(7)e-"'/^ + / \qi-qo\. 



For all small 7 we have that /C C T^. Hence, 

/ \qi-qo\< [ [ H\\y-n\\)+ [ [ <A(||y - n||) < poly(7)e"^/^' 

< poly(7)e-"'/^. 

Putting this all together, we have that J \qi — qo\ ^ poly(7)e~"'/'''. 
Now we apply Lemma 1 and conclude that, for every 7 > 0, 

supE(L(M,M)) > ^(1 - poly(7)e-'"/^)'''. 
Q 8 

Set 7 X w/logn and conclude that, for all large n, 

supE(L(M,M))>^-^. 
Q Se^logn 

This concludes the proof of the lower bound except that it remains to show 
that Mo, Ml €M{k,). Note that |c"(?/)| = a~'^\cos{u/{a^)\. Hence, as long 
as a > y/n, sup„|c"(«)| <1/k. It now follows that Mo,Mi G M^k). This 
completes the proof. □ 
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Remark. Consider the special case where D = 2, d = 1 and the manifold 
has the special form {(u,m{u)) :u £ M} for some function m:M— t-M. In 
this case, estimating the manifold is like estimating a regression function 
with errors in variables. (More on this in Section 6.) The rate obtained 
for estimating a regression function with errors in variables under these 
conditions [Fan and Truong (1993)] is 1/logn in agreement with our rate. 
However, the proof technique is not quite the same as we explain in Section 6. 

Remark. The proof of the lower bound is similar to other lower bounds 
in deconvolution problems. There is an interesting technical difference, how- 
ever. In standard deconvolution, we can choose Go and Gi so that gKt) — 
g^it) is zero in a large neighborhood around the origin. This simplifies the 
proof considerably. It appears we cannot do this in the manifold case since Go 
and Gi have different supports. 

Next we construct an upper bound. We use a standard deconvolution 
density estimator 'g (even thought G has no density), and then we threshold 
this estimator. 



Theorem 9. Fix any < 5 < 1/2. Let h = l/y/Jogn. Let A„ be such that 

where k > d/{25), C is defined in Lemma 11 and C" and L are defined in 
Lemma 12. Define M = {y:g{y) > A„} where g is defined in (34)- Then for 
all large n, 

1 \ (l-<5)/2 

(32) inf sup Eg [L(M, M)] < C \ 



M QdQ \\ogn 

Let us now define the estimator in more detail. Define ipkiv) = sinc^'^(y/(2A;)). 
By elementary calculations, it follows that 



2k^ 

where Br = J ^ - — -k where J = jj. The following properties of ijjk 

r times 

and V'fc follow easily: 

(1) The support of ifj^. is [—1,1]. 

(2) V'fc > and ^* > 0. 

(3) jrk{t)dt = ijm = l. 

(4) tp'^ and tpk are spherically symmetric. 

(5) \My)\ < l/{{2kf^\y\''^) for all |y| > ^/{2k). 

Abusing notation somewhat, when n is a vector, we take V'fc(^) = V'fcd!''^!!)- 
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Define 

(33) r{t) = ^^rk{ht), 

where q*{t) = ^ X^^Li is the empirical characteristic function. Now 

define 

Let giy)=Eig{y)). 

Lemma 10. For all y G 

Proof. Let ipk,h{x) = h'^il^kix/h). Hence, ipl ^(t) = il^lith). Now, 
\2ttJ J (l)*{t) 
27tJ J (t)*{t) 



27r 



e"** yrk{th)g*{t)dt 



-I \D p / 1 \ ^ 



2^ , , e-** 'rk,hm{t)dt=l- ) / e-' ^ {g * ^l^k,hnt) dt 



{y-u)dG{u) 



\ I I I y ~ 



Lemma 11. We have that inf y^MnfC 9{y) > C'h'^ 



D 



Proof. Choose any xGMdIC and let B = B{x, Ch). Note that G{B) > 
b{M)ch'^. Hence, 



g{x) = (2vr)-^/i-^ / ( ^ ) dG{u) 
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Lemma 12. Fix < (5 < 1/2. Suppose that k > d/{26). Then, 

(35) sup{g(y) : y G /C, M) > Lh'-'} < C" L-^'^ ' 



^ \ D-d 



Proof. Let y be such that d{y,M) > Lh^ ^. For integer j > 1, define 
Aj = [B{y, (j + l)Lh'-^) - B{y,jLh'-^)] nMnJC. 



Then 



1 f I 2kh 



< 



(*) <C( 

<C( L-^'^/i'^ 

/ 1 \ D-d 

<C"L-^H ^ 



where equation (*) follows because G is a probability measure and 'Yliji '^^ < 
oo, and equation (**) follows because 2k5 >d. □ 

Now define r„ = sup^^ \g{y) -g{y)\. 



Lemma 13. Let h = l/-v/logn, and let 1. Then, for large n, 

(36) r„=^= 

on an event An of probability at least 1 — n~^. 
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Proof. We proceed as in Theorem 2.3 of Stefanski (1990). Note that 



(37) 



9{y)-g{y) 



1 

2^ 



D 



-it y 



<P*{t) 



int)-q*it))dt, 



and also note that the integrand is for ||t|| > 1/h. So 



(38) 



sup|?(y) - giy)\ < 



Ar 



(2vr) 



D 



\\t\\<l/h 



dt 



where A„ = sup||t||<i/,, |^(t) - g*(t)|. 

For Z) = 1, it follows from Theorem 4.3 of Yukich (1985) that 



(39) Q"(A„ > Ae) < 4iV(e) exp 



ne 



8iV(6)expf-^Y 



V 96 y 



8 + 4e/3^ 

where N{e) is the bracketing number of the set of complex exponentials, 
which is given by N^e) = 1 + MM^^ and is defined by Q(||y|| > JVQ < 
e/4. By a similar argument, we have that in D dimensions, 

^ +8Af(e)expf-^Y 



(40) sup Q"(A„ > 4e) < 4iV(e) exp 
QeS„ 

where now 

(41) N{e) = C 1 + 



ne 



+ 4e/3 



96 J 



24M,r„ 



-(D-l) 



and Ms is defined by supggg^ Qi\\y\\ > M^) < e/4. Note that = 0(1). It 

except on a set of probability where ^ can 



follows that An < 
be made arbitrarily large by taking C large. 

Now, note that ipl{ht) / cj)* (t) is a spherically symmetric function i?(p||). 
Hence, 



i 



t\\<l/h 



dt 



Js=0 



where the last result follows from Lemma 3.1 in Stefanski (1990) using pa- 
rameters 6 = 2, 7 = 1/2, r = 2A; + 2, (3 = D — 1, with X = h. The value of r 
follows from the definition of ip^.. The result now follows by combining this 
bound with (38). □ 

Now we can complete the proof of the upper bound. 

Proof of Theorem 9. On the event An where r„ < {1/ y/logn)^^~^^~^ 
(defined in the previous lemma), we have 

D-d / 1 \ 4A:+4-D 



inf^?(y)> inf g(y)-r„>C( i 



yeMnJC 



1 



\/logn 
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. s D-d 
>(C/2)(^-J >A„. 

This implies that M n)CcMn)C 
Next, we have 

sup g{y) < sup g{y) + 

d(y,M)>Lh^-^ d{y,M)>Lh'^-s 



\ D-d / 1 X 4fc+4-D 



hj \^/^ogn 



for large enough L. This implies that 

{y : y G /C and M) > L/i^"^} n M = 0. 
Therefore, on A, L{M,M) < C{j^)^^~^^/'^ and hence, 

E(L(M, M)) = E(L(M, M)UJ + E(L(M, ) 
1 X(l-5)/2 



1 \ (l-<5)/2 / 1 X (l-<5)/2 

<C ; +n-^<C 



Jogn / \logn 
and the theorem is proved. □ 

Remark. Again, the proof of the upper bound is similar to proofs used 
in other deconvolution problems. But once more, there are interesting differ- 
ences. In particular, the density estimator g is not estimating any underlying 
density since the measure G is singular and thus does not have a density. 
Hence, the usual bias calculation is meaningless. 

Remark. Note that M is a set not a manifold; if desired, we can re- 
place M with any manifold in {M € Ai:M C M}, and then the estimator 
is a manifold and the rate is the same. 

Remark. The upper bound is slightly slower than the lower bound. 
The rate is consistent with the results in Caillerie et al. (2011) who show 
that K{W2{g,G)) < C / \/log n where W2 is the Wasserstein distance. In the 
special case where the manifold has the form {(ii,m(n)) -.u G M} for some 
function m, the problem can be viewed as nonparametric regression with 
measurement error; see Section 6. In this special case, we can use the decon- 
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volution kernel regression estimator in Fan and Truong (1993) which achieves 
the rate 1/logn. We do not know of any estimator in the general case that 
achieves the rate 1/logn, although we conjecture that the following estima- 
tor might have a better rate: let (M,G) minimize sup||^||<2-^ !§*(?/) ~Q*m g(^)\ 
where r„ = 0{^/logn). In any case, as with all Gaussian deconvolution prob- 
lems, the rate is very slow, and the difference between 1/logn and l/-^logn 
is not of practical consequence. 

6. Singular deconvolution. Estimating a manifold under additive noise 
is related to deconvolution. It is also related to regression with errors in 
variables. The purpose of this section is to explain the connections between 
the problems. 

6.1. Relationship to density deconvolution. Recall that the model is K = 
X + Z where X ~ G, G is supported on a manifold M and Z ^ ^. G is 
a singular measure supported on the d-dimensional manifold M. 

Now consider a somewhat simpler model: suppose again that Yi = Xi + Zi, 
but suppose that X has a density g on (instead of being supported on 
a manifold). All three distributions Q, G and $ have D-dimensional support 
and Q = G-k^ . The problem of recovering the density g oi X from Yi, . . . , 1^ 
is the usual density deconvolution problem. A key reference is Fan (1991). 

Most of the existing literature on deconvolution assumes that X and Y 
have the same support, or at least that the supports have the same dimen- 
sion; an exception is Koltchinskii (2000). Manifold learning may be regarded 
as the problem of deconvolution for singular measures. 

It is instructive to compare the least favorable pair used for proving the 
lower bounds in the ordinary case versus the singular case. Figure 4 shows 
a typical least favorable pair for proving a lower bound in ordinary deconvo- 
lution. The top left plot is a density and the top right plot is a density gi 
which is a perturbed version of go- The Li distance between the densities is e. 
The bottom plots are Qo = f (p{y — x)go{x) dx and Qi = f 4>{y — x)gi{x) dx. 
These densities are nearly indistinguishable, and, in fact, their total varia- 
tion distance is of order e^^^^ . Of course, these distributions have the same 
support and hence such a least favorable pair will not suffice for proving 
lower bounds in the manifold case where we will need two densities with 
different support. 

Figure 5 shows the type of least favorable pair we used for manifold learn- 
ing. The top two plots do not show the densities; rather they show the sup- 
port of the densities. The distribution g^ is uniform on the circle in the top 
left plot. The distribution gi is uniform on the perturbed circle in the top 
right plot. The Hausdorff distance between the supports of densities is e. 
The bottom plots are qo = f 4>{y ~ x)go{x)dx and qi = f (j){y — x)gi{x) dx. 
Again, these densities are nearly indistinguishable, and, in fact, their total 
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Fig. 4. A typical least favorable pair for proving a lower bounds in ordinary deconvo- 
lution. The top left plot is a density go and the top right plot is a density gi which is 
a perturbed version of go- The L\ distance between the densities is e. The bottom plots are 
qo = /(/"(y — x)go{x)dx and qi = f 4'{y — x)gi{x) dx . These densities are nearly indistin- 
guishable and, in fact, their total variation distance is e~^^^ . 

variation distance is e~^^^ . In this case, however, go and gi have different 
supports. 

6.2. Relationship to regression with measurement error. We can also re- 
late the manifold estimation problem with nonparametric regression with 
measurement error. Suppose that 

(42) Ui = Xi + Z2^, 

Yi = m{Xi) + Zii, 




Fig. 5. The type of least favorable pair needed for proving lower bounds in manifold 
learning. The distribution go is uniform on the circle in the top left plot. The distribution gi 
is uniform on the perturbed circle in the top right plot. The Hausdorff distance between the 
supports of the densities is e. The bottom plots are heat maps of qo = f (piv ~ x)go{x) dx 
and qi = J 4>{y — x) gi{x) dx . These densities are nearly indistinguishable and, in fact, their 
total variation distance is e~^^'^ . 
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and we want to estimate the regression function m. If we observe (Xi, Yi), . . . , 
{Xn,Yn), then this is a standard nonparametric regression problem. But if 
we only observe {Ui,Yi), . . . , {Un,Yn), then this is the usual nonparamet- 
ric regression with measurement error problem. The rates of convergence 
are similar to deconvolution. Indeed, Fan and Truong (1993) have an argu- 
ment that converts nonparametric regression with measurement error into 
a density deconvolution problem. Let us see how this related to manifold 
learning. 

Suppose that D = 2 and d = 1. Futher, suppose that the manifold is 
function-like, meaning that the manifold is a curve of the form M = {(u, 
m{u)) : u G M} for some function m. Then each Yi can be written in the form 



Yi2 J \m{Ui) J 

which is exactly of the form (42). Let Q be all such distributions obtained 
this way with \m"(u)\ <1/k. However, this only holds when the manifold 
has the function-like form. Moreover, the lower bound argument in Fan and 
Truong (1993) cannot directly be transferred to the manifold setting as we 
now explain. 

In our lower bound proof, we defined a least favorable pair qq and qi 
for the distribution of Y as follows. Take Mq = {(n,0) :u G M} and Mi = 
{{u,m{u)) '.u £ W}. [In fact, we used {u,m{u)) and {u,—m{u)), but the 
present discussion is clearer if we use {u,0) and {u,m{u)).] Let Y = (^1,12)- 
For Mq, the distribution qq for Y is based on 

^^2) " (0) + {zl 

The density of (C/, Y2) is fo{u, 1/2) = C,{u)4>{y2) where Q is some density for U . 
Then 

go (2/1 , ?/2) = /o * ^ = y" /o (yi - ^1 , 2/2 ) d$ (2:1 ) , 

where the convolution symbol here and in what follows, refers to convolution 
only over U + Zi. 

Now let (2/1,2/2) denote the distribution of Y in the model 

YA^f U \ (Zi 
Y2J ym{U)J^\Z2 

This generates the least favorable pair qq, qi used in our proof (restricted 
to this special case). 

The least favorable pair used by Fan and Truong is different in a sub- 
tle way. The first distribution go is the same. The second, which we will 
denote wi, is constructed as follows. Let 

m{yi,y2) = 
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where the convolution is only over U, 

where = g{0^ l^iC/ \/l)/9{0 =^(0; -f^ is a perturbation function 

such as a cosine, and /iq is chosen so that / /io(y2) %2 = and / 2/2^0(2/2) dy2 = 
1. Now we show that (1/1,7/2) 7^ Qiiyi,y2)- In fact, wi is not in Q. Note 
that 

'wi{yi,y2) = fi*'^ = qo{yi,y2) + ^ho{y2) j h( ^^^^ ^ d<^{zi). 

Now, 

qi{y2\u) = (I){y2 - m{u)), 

but 

fi{y2\u) = '^^'ff'l^^ = (p{y2) + m{u)ho{y2). 
h{u) 

These both have mean m{u) but the distributions are different. Indeed, the 
marginals wi{y2) and qi{y2) are different. In fact, 

wi{y2) = 90(2/2) + c/io (2/2) 

for some c. This is not in our class because it is not of the form (j){y2 — ra[u)). 
Hence, wi is not in our class Q: it does not correspond to drawing a point 
on a manifold and adding noise. 

The point is that manifold learning reduces to nonparametric regression 
with errors only in the special case that the manifold is function-like. And 
even in this case, the proofs of the bounds are somewhat different than the 
usual proofs. 

7. Discussion. The purpose of this paper is to establish minimax bounds 
on estimating manifolds. The estimators used to prove the upper bounds 
are theoretical constructions for the purposes of the proofs. They are not 
practical estimators. 

There is a large literature on methodology for estimating manifolds. How- 
ever, these estimators are not likely to be optimal except under stringent 
conditions. In current work we are trying to bridge the gap between the 
theory and the methodology. 

Probably the most realistic noise condition is the additive model. In this 
case, we are dealing with a singular deconvolution problem. The upper bound 
used deconvolution techniques. Such methods require that the noise distribu- 
tion is known (or is at least restricted to some narrow class of distributions) . 
This seems unrealistic in real problems. A more realistic goal is to estimate 
some proxy manifold M* that, in some sense, approximates M. We are 
currently working on such techniques. 
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